def get_scaling_factor_1(x,cutoff=0.05):
''' x is a normlized cell '''
=x.astype('float16') #To prevent error in numerical calculation
x=np.array(x.value_counts().sort_index().head(2).index)
yif np.abs(y[0]-y[1]/2)>10**-5:
raise
return 1/y[0]
Functions to denormlize scRNA-seq data
Functions
denorm
denorm (ad:anndata._core.anndata.AnnData, gxc:bool=False, base:float=None, cont:float=1.0, rint:bool=True, method:str='Top2', float16:bool=False, cutoff:float=0.05, logger: <module'logging'from'/opt/hostedtoolcache/Python/3.10.15/x64/lib/ python3.10/logging/__init__.py'>=None)
denormalize takes a cell gene expression matrix that has been normalized according to single-cell RNA sequencing preprocessing procedure and return a recovered count matrix by reversing the logarithmization step and then the total-count normalization step utilizing the unscale function. If the imput matrix is not normalized by first total-count scaling then logarithmization, then the function produces an error indicating so. We also handle matrices that have not been logarithmized. denormalize: csr_matrix -> csr_matrix*
Type | Default | Details | |
---|---|---|---|
ad | AnnData | The input AnnData | |
gxc | bool | False | Change to True if the data is stored with gene by cell |
base | float | None | Give the base if it is known or auto |
cont | float | 1.0 | The constant plused after scaling |
rint | bool | True | Round elements of the result to the nearest integer |
method | str | Top2 | Top2 or Reg |
float16 | bool | False | |
cutoff | float | 0.05 | |
logger | logging | None |
scdenorm
scdenorm (fin:str, fout:str=None, by:str=None, gxc:bool=False, base:float=None, cont:float=1.0, rint:bool=True, method:str='Top2', float16:bool=False, cutoff:float=0.05, verbose:int=0)
denormalize takes a cell gene expression matrix that has been normalized according to single-cell RNA sequencing preprocessing procedure and return a recovered count matrix by reversing the logarithmization step and then the total-count normalization step utilizing the unscale function. If the input matrix is not normalized by first total-count scaling then logarithmization, then the function produces an error indicating so. We also handle matrices that have not been logarithmized. denormalize: csr_matrix -> csr_matrix*
Type | Default | Details | |
---|---|---|---|
fin | str | The input file or AnnData | |
fout | str | None | The path of output file if provided |
by | str | None | Split AnnData by sample or batch |
gxc | bool | False | Change to True if the data is stored with gene by cell |
base | float | None | Give the base if it is known or auto |
cont | float | 1.0 | The constant plused after scaling |
rint | bool | True | Round elements of the result to the nearest integer |
method | str | Top2 | Top2 or Reg |
float16 | bool | False | Coverting value to float16 for preventing error in numerical calculation |
cutoff | float | 0.05 | |
verbose | int | 0 |
solve_bc
solve_bc (p, y1, y2)
auto_detect
auto_detect (smtx, cutoff=1e-06)
solve_s
solve_s (s, C, X)
get_scaling_factor_by_reg
get_scaling_factor_by_reg (x, cutoff=0.05, f16=False)
get_scaling_factor takes a cell vector and its index in the gene expression matrix that has been scaled according to single-cell RNA sequencing preprocessing procedure and return the scaling factor for the cell, or produce an error indicating the cell has not been processed by our assumption.
get_scaling_factor_by_top2
get_scaling_factor_by_top2 (x, cutoff=0.05, f16=False)
get_scaling_factor takes a cell vector and its index in the gene expression matrix that has been scaled according to single-cell RNA sequencing preprocessing procedure and return the scaling factor for the cell, or produce an error indicating the cell has not been processed by our assumption.
get_scaling_factor
get_scaling_factor (x, cutoff=0.05, method='Top2', f16=False)
get_scaling_factor takes a cell vector and its index in the gene expression matrix that has been scaled according to single-cell RNA sequencing preprocessing procedure and return the scaling factor for the cell, or produce an error indicating the cell has not been processed by our assumption. get_scaling_factor: ndarray Nat (Num) -> Num
check_unscale
check_unscale (x, base=2.718281828459045, cont=1, cutoff=0.05, method='Top2', f16=False, plot=True)
select_base
select_base (x, cont=1, cutoff=0.05, method='Top2', f16=False, plot=False)
unscale_mat
unscale_mat (smtx, base=2.718281828459045, cont=1, cutoff=0.05, rint=True, method='Top2', f16=False, gpu=False)
unscale takes a cell gene expression matrix that has been quality-controlled and scaled according to single-cell RNA sequencing preprocessing procedure and return a recovered count matrix by finding the scaling factor for each cell, or produce an error indicating the matrix has not been processed by our assumption. unscale: csr_matrix -> csr_matrix*
def get_scaling_factor_2(x,cutoff=0.05):
"""
get_scaling_factor takes a cell vector and its index in the gene expression matrix
that has been scaled according to single-cell RNA sequencing preprocessing procedure
and return the scaling factor for the cell,
or produce an error indicating the cell has not been processed by our assumption.
get_scaling_factor: ndarray Nat (Num) -> Num
"""
=x.astype('float16')
x=pd.Series(x)
x=x.min()
xm=x/xm
xif np.abs(x-x.round()).mean()>cutoff:
raise
return 1/xm
check_plot
check_plot (c, idx, n=10)
Check_plot takes a cell vector and its index in the gene expression matrix and produce a plot of the first N most frequent values against their ranks. Such a plot is used for error-checking in the unscaling process.
create test data
=sc.datasets.pbmc3k() ad
/home/huang_yin/anaconda3/envs/sc/lib/python3.9/site-packages/anndata/__init__.py:51: FutureWarning: `anndata.read` is deprecated, use `anndata.read_h5ad` instead. `ad.read` will be removed in mid 2024.
warnings.warn(
=ad[:10].copy() ad
ad
AnnData object with n_obs × n_vars = 10 × 32738
var: 'gene_ids'
=1e4)
sc.pp.normalize_total(ad, target_sum#sc.pp.log1p(ad)
= ad.X.tocsr().asfptype() smtx
= ad.X.tocsr().asfptype() smtx
=smtx scaled_counts
=0,0.001
i,cutoff get_scaling_factor(scaled_counts.getrow(i).data, cutoff)
0.24209999655982023
=[],[]
scaling_factors,success_cellsfor i in tqdm(range(scaled_counts.shape[0])):
try:
scaling_factors.append(get_scaling_factor(scaled_counts.getrow(i).data,cutoff))
success_cells.append(i)except:
f"Warning:The cell {i} fails to denormlize, and be deleted") logging.warning(
100%|██████████| 3593700/3593700 [20:53<00:00, 2867.61it/s]
CPU times: user 20min 23s, sys: 33.2 s, total: 20min 56s
Wall time: 20min 53s
=[],[]
scaling_factors,success_cellsfor i in tqdm(range(scaled_counts.shape[0])):
try:
scaling_factors.append(get_scaling_factor_2(scaled_counts.getrow(i).data,cutoff))
success_cells.append(i)except:
f"Warning:The cell {i} fails to denormlize, and be deleted") logging.warning(
100%|██████████| 2700/2700 [00:01<00:00, 2064.87it/s]
CPU times: user 1.31 s, sys: 11.2 ms, total: 1.32 s
Wall time: 1.31 s
= diags(scaling_factors)
scaling_factors = scaling_factors*scaled_counts counts
CPU times: user 38.4 s, sys: 18.2 s, total: 56.6 s
Wall time: 56.7 s
counts
<3593700x32738 sparse matrix of type '<class 'numpy.float64'>'
with 3043842604 stored elements in Compressed Sparse Row format>
counts.data
array([4. , 1. , 9.99999954, ..., 1. , 1. ,
1. ])
counts.data
array([4. , 1. , 9.99999954, ..., 1. , 1. ,
1. ])
scaled_counts
<326700x32738 sparse matrix of type '<class 'numpy.float32'>'
with 276712964 stored elements in Compressed Sparse Row format>
counts.data
array([3.59520866, 0.89880217, 8.98802125, ..., 0.34156169, 0.34156169,
0.34156169])
smtx.data
array([1.6352079, 1.6352079, 2.2258174, ..., 1.7980369, 1.7980369,
2.779648 ], dtype=float32)
=smtx.getrow(0).data
x
=pd.Series(x).value_counts().sort_index()
x
=np.exp(np.array(x.index))-1
x
=x/(x[0]+0.1)
x
abs(x-x.round()).mean() np.
0.24593542762363435
2]==x.sort_values(ascending=False).index[:2]) np.alltrue(x.index[:
True
==pd.Series(x).value_counts().sort_values(ascending=False).index pd.Series(x).value_counts().sort_index().index
array([ True, True, False, False, False, False, False, False, False,
False, False, True, False, False, False, False, False, False,
False, False, False, True, False, False, False, False, False,
False, False, False, False, True, False, False])
Testing get_scaling_factor
=smtx.getrow(1).data x
x
array([ 4.0791354, 2.0395677, 2.0395677, ..., 2.0395677, 16.316542 ,
2.0395677], dtype=float32)
='Top2') get_scaling_factor(x,method
0.4904214559386973
='Reg') get_scaling_factor(x,method
0.49042145712545715
get_scaling_factor_by_top2(x)
0.4904214559386973
get_scaling_factor_by_reg(x)
0.49042145712545715
Testing check_unscale
=ad1.X.getrow(0).data x
='Top2') check_unscale(x,method
True
='Reg') check_unscale(x,method
True
=2) check_unscale(x,base
ERROR:root:The base 2 is not match with the data.
False
Testing scdenorm
'data/scaled.mtx', smtx[1:10,]) mmwrite(
=sc.read('data/scaled.mtx') ad1
ad1
AnnData object with n_obs × n_vars = 9 × 32738
'scaled_tran.mtx', smtx.transpose()) mmwrite(
1.Testing unscale_mat
- input is a sparse matrix with cell by gene
- output is a sparse matrix after denormalization
='Top2') unscale_mat(ad1.X,method
100%|██████████| 9/9 [00:00<00:00, 652.27it/s]
(<9x32738 sparse matrix of type '<class 'numpy.float64'>'
with 7403 stored elements in Compressed Sparse Row format>,
[0, 1, 2, 3, 4, 5, 6, 7, 8])
='Reg') unscale_mat(ad1.X,method
0%| | 0/9 [00:00<?, ?it/s]WARNING:root:Warning:The cell 3 fails to denormlize, and be deleted
100%|██████████| 9/9 [00:00<00:00, 201.71it/s]
(<8x32738 sparse matrix of type '<class 'numpy.float64'>'
with 6881 stored elements in Compressed Sparse Row format>,
[0, 1, 2, 4, 5, 6, 7, 8])
2.Testing scdenorm with right base
'data/scaled.mtx',fout='data/scd_scaled.h5ad',base=np.e, verbose=1,cutoff=0.05,method='Top2') scdenorm(
INFO:my_logger:Reading input file: data/scaled.mtx
INFO:my_logger:The dimensions of this data are (9, 32738).
INFO:my_logger:Denormlizing ...the base is 2.718281828459045
100%|██████████| 9/9 [00:00<00:00, 899.38it/s]
INFO:my_logger:Writing output file: data/scd_scaled.h5ad
'data/scaled.mtx',fout='data/scd_scaled.h5ad',base=np.e, verbose=0,cutoff=0.05,method='Reg') scdenorm(
0%| | 0/9 [00:00<?, ?it/s]WARNING:my_logger:Warning:The cell 3 fails to denormlize, and be deleted
100%|██████████| 9/9 [00:00<00:00, 219.13it/s]
3.Testing scdenorm with wrong base
'data/scaled.mtx',fout='data/scd_scaled.h5ad',base=2, verbose=1) scdenorm(
INFO:my_logger:Reading input file: data/scaled.mtx
INFO:my_logger:The dimensions of this data are (9, 32738).
INFO:my_logger:Failed to obtain scale factor. Error is 0.23859739923659193. x is [ 1. 1.79478553 2.49621531 3.13708999 3.73423045 4.29941127
4.83767872 5.35576114 5.85365854 6.33809924 6.80571909 7.25988225
7.71068124 8.14129521 8.57190917 8.98906644 9.39949537 9.80319596
10.20016821 10.97392767 11.35071489 11.7275021 12.09083263 12.81749369
13.17409588 13.52396972 13.87384357 14.22371741 14.56013457 14.89655172
15.23296888 15.56938604 15.89234651 16.21530698 16.53826745 17.18418839
17.49369218 17.80319596 18.42220353 18.73170732 19.0412111 20.81749369
21.11354079 21.40958789 22.25735913 23.38772077 23.93944491 27.41126997
28.9587889 31.67703953 32.8881413 43.06139613]
INFO:my_logger:The base 2 is not match with the data.
ERROR:my_logger:Denormlization has failed. Output the orignal data
INFO:my_logger:Writing output file: data/scd_scaled.h5ad
4.Testing scdenorm without base
'data/scaled.mtx',fout='data/scd_scaled.h5ad',verbose=1) scdenorm(
INFO:my_logger:Reading input file: data/scaled.mtx
INFO:my_logger:The dimensions of this data are (9, 32738).
INFO:my_logger:Selecting base
INFO:my_logger:Denormlizing ...the base is 2.718281828459045
b is 2.718281828459045
100%|██████████| 9/9 [00:00<00:00, 1280.36it/s]
INFO:my_logger:Writing output file: data/scd_scaled.h5ad
'data/scaled.mtx', verbose=1) scdenorm(
INFO:root:Reading input file: data/scaled.mtx
INFO:root:The dimensions of this data are (9, 32738).
INFO:root:select base
INFO:root:denormlizing ...the base is 2.718281828459045
b is 2.718281828459045
100%|██████████| 9/9 [00:00<00:00, 990.42it/s]
AnnData object with n_obs × n_vars = 9 × 32738
=1) scdenorm(ad1, verbose
INFO:root:The dimensions of this data are (9, 32738).
INFO:root:select base
INFO:root:denormlizing ...the base is 2.718281828459045
b is 2.718281828459045
100%|██████████| 9/9 [00:00<00:00, 960.72it/s]
AnnData object with n_obs × n_vars = 9 × 32738
4.Testing scdenorm with auto detection
=sc.datasets.pbmc3k() ad
/home/huang_yin/anaconda3/envs/sc/lib/python3.9/site-packages/anndata/__init__.py:51: FutureWarning: `anndata.read` is deprecated, use `anndata.read_h5ad` instead. `ad.read` will be removed in mid 2024.
warnings.warn(
=ad[:200].copy() ad
ad
AnnData object with n_obs × n_vars = 200 × 32738
var: 'gene_ids'
=1e4)
sc.pp.normalize_total(ad, target_sum sc.pp.log1p(ad)
=1) scdenorm(ad,verbose
INFO:my_logger:The dimensions of this data are (200, 32738).
INFO:my_logger:Selecting base
INFO:my_logger:Denormlizing ...the base is 2.718281828459045
b is 2.718281828459045
100%|██████████| 200/200 [00:00<00:00, 1783.47it/s]
AnnData object with n_obs × n_vars = 200 × 32738
var: 'gene_ids'
uns: 'log1p'
='auto',verbose=1) scdenorm(ad,base
INFO:my_logger:The dimensions of this data are (200, 32738).
INFO:my_logger:Denormlizing ...the base is 2.718281957850356
fun: 1.045837662921792e-10
hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
jac: array([5.11726853e-06, 1.49873895e-06])
message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
nfev: 99
nit: 31
njev: 33
status: 0
success: True
x: array([2.71828196, 0.99999995])
100%|██████████| 200/200 [00:00<00:00, 1013.93it/s]
AnnData object with n_obs × n_vars = 200 × 32738
var: 'gene_ids'
uns: 'log1p'
=auto_detect(ad.X,1e-3) base,cont
fun: 6.749703574420865e-05
hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
jac: array([-0.28310845, -0.04961504])
message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
nfev: 81
nit: 26
njev: 27
status: 0
success: True
x: array([2.71767589, 1.00073907])
base
2.717675894469142