Functions to denormlize scRNA-seq data

The main functions of denormlizing the normlized scRNA-seq data

Functions


source

denorm

 denorm (ad:anndata._core.anndata.AnnData, gxc:bool=False,
         base:float=None, cont:float=1.0, rint:bool=True,
         method:str='Top2', float16:bool=False, cutoff:float=0.05, logger:
         <module'logging'from'/opt/hostedtoolcache/Python/3.10.14/x64/lib/
         python3.10/logging/__init__.py'>=None)

denormalize takes a cell gene expression matrix that has been normalized according to single-cell RNA sequencing preprocessing procedure and return a recovered count matrix by reversing the logarithmization step and then the total-count normalization step utilizing the unscale function. If the imput matrix is not normalized by first total-count scaling then logarithmization, then the function produces an error indicating so. We also handle matrices that have not been logarithmized. denormalize: csr_matrix -> csr_matrix*

Type Default Details
ad AnnData The input AnnData
gxc bool False Change to True if the data is stored with gene by cell
base float None Give the base if it is known or auto
cont float 1.0 The constant plused after scaling
rint bool True Round elements of the result to the nearest integer
method str Top2 Top2 or Reg
float16 bool False
cutoff float 0.05
logger logging None

source

scdenorm

 scdenorm (fin:str, fout:str=None, by:str=None, gxc:bool=False,
           base:float=None, cont:float=1.0, rint:bool=True,
           method:str='Top2', float16:bool=False, cutoff:float=0.05,
           verbose:int=0)

denormalize takes a cell gene expression matrix that has been normalized according to single-cell RNA sequencing preprocessing procedure and return a recovered count matrix by reversing the logarithmization step and then the total-count normalization step utilizing the unscale function. If the input matrix is not normalized by first total-count scaling then logarithmization, then the function produces an error indicating so. We also handle matrices that have not been logarithmized. denormalize: csr_matrix -> csr_matrix*

Type Default Details
fin str The input file or AnnData
fout str None The path of output file if provided
by str None Split AnnData by sample or batch
gxc bool False Change to True if the data is stored with gene by cell
base float None Give the base if it is known or auto
cont float 1.0 The constant plused after scaling
rint bool True Round elements of the result to the nearest integer
method str Top2 Top2 or Reg
float16 bool False Coverting value to float16 for preventing error in numerical calculation
cutoff float 0.05
verbose int 0

source

solve_bc

 solve_bc (p, y1, y2)

source

auto_detect

 auto_detect (smtx, cutoff=1e-06)

source

solve_s

 solve_s (s, C, X)

source

get_scaling_factor_by_reg

 get_scaling_factor_by_reg (x, cutoff=0.05, f16=False)

get_scaling_factor takes a cell vector and its index in the gene expression matrix that has been scaled according to single-cell RNA sequencing preprocessing procedure and return the scaling factor for the cell, or produce an error indicating the cell has not been processed by our assumption.


source

get_scaling_factor_by_top2

 get_scaling_factor_by_top2 (x, cutoff=0.05, f16=False)

get_scaling_factor takes a cell vector and its index in the gene expression matrix that has been scaled according to single-cell RNA sequencing preprocessing procedure and return the scaling factor for the cell, or produce an error indicating the cell has not been processed by our assumption.


source

get_scaling_factor

 get_scaling_factor (x, cutoff=0.05, method='Top2', f16=False)

get_scaling_factor takes a cell vector and its index in the gene expression matrix that has been scaled according to single-cell RNA sequencing preprocessing procedure and return the scaling factor for the cell, or produce an error indicating the cell has not been processed by our assumption. get_scaling_factor: ndarray Nat (Num) -> Num


source

check_unscale

 check_unscale (x, base=2.718281828459045, cont=1, cutoff=0.05,
                method='Top2', f16=False, plot=True)

source

select_base

 select_base (x, cont=1, cutoff=0.05, method='Top2', f16=False,
              plot=False)

source

unscale_mat

 unscale_mat (smtx, base=2.718281828459045, cont=1, cutoff=0.05,
              rint=True, method='Top2', f16=False, gpu=False)

unscale takes a cell gene expression matrix that has been quality-controlled and scaled according to single-cell RNA sequencing preprocessing procedure and return a recovered count matrix by finding the scaling factor for each cell, or produce an error indicating the matrix has not been processed by our assumption. unscale: csr_matrix -> csr_matrix*

def get_scaling_factor_1(x,cutoff=0.05):
    ''' x is a normlized cell '''
    x=x.astype('float16') #To prevent error in numerical calculation
    y=np.array(x.value_counts().sort_index().head(2).index)
    if np.abs(y[0]-y[1]/2)>10**-5:
        raise
    return 1/y[0]
def get_scaling_factor_2(x,cutoff=0.05):
    """
    get_scaling_factor takes a cell vector and its index in the gene expression matrix 
    that has been scaled according to single-cell RNA sequencing preprocessing procedure 
    and return the scaling factor for the cell,
    or produce an error indicating the cell has not been processed by our assumption.
    get_scaling_factor: ndarray Nat (Num) -> Num
    """
    x=x.astype('float16')
    x=pd.Series(x)
    xm=x.min()
    x=x/xm
    if np.abs(x-x.round()).mean()>cutoff:
        raise 
    return 1/xm

source

check_plot

 check_plot (c, idx, n=10)

Check_plot takes a cell vector and its index in the gene expression matrix and produce a plot of the first N most frequent values against their ranks. Such a plot is used for error-checking in the unscaling process.

create test data

ad=sc.datasets.pbmc3k()
/home/huang_yin/anaconda3/envs/sc/lib/python3.9/site-packages/anndata/__init__.py:51: FutureWarning: `anndata.read` is deprecated, use `anndata.read_h5ad` instead. `ad.read` will be removed in mid 2024.
  warnings.warn(
ad=ad[:10].copy()
ad
AnnData object with n_obs × n_vars = 10 × 32738
    var: 'gene_ids'
sc.pp.normalize_total(ad, target_sum=1e4)
#sc.pp.log1p(ad)
smtx = ad.X.tocsr().asfptype()
smtx = ad.X.tocsr().asfptype()
scaled_counts=smtx
i,cutoff=0,0.001
get_scaling_factor(scaled_counts.getrow(i).data, cutoff)
0.24209999655982023
scaling_factors,success_cells=[],[]
for i in tqdm(range(scaled_counts.shape[0])):
    try:
        scaling_factors.append(get_scaling_factor(scaled_counts.getrow(i).data,cutoff))
        success_cells.append(i)
    except:
        logging.warning(f"Warning:The cell {i} fails to denormlize, and be deleted")
100%|██████████| 3593700/3593700 [20:53<00:00, 2867.61it/s]
CPU times: user 20min 23s, sys: 33.2 s, total: 20min 56s
Wall time: 20min 53s
scaling_factors,success_cells=[],[]
for i in tqdm(range(scaled_counts.shape[0])):
    try:
        scaling_factors.append(get_scaling_factor_2(scaled_counts.getrow(i).data,cutoff))
        success_cells.append(i)
    except:
        logging.warning(f"Warning:The cell {i} fails to denormlize, and be deleted")
100%|██████████| 2700/2700 [00:01<00:00, 2064.87it/s]
CPU times: user 1.31 s, sys: 11.2 ms, total: 1.32 s
Wall time: 1.31 s
scaling_factors = diags(scaling_factors)
counts = scaling_factors*scaled_counts
CPU times: user 38.4 s, sys: 18.2 s, total: 56.6 s
Wall time: 56.7 s
counts
<3593700x32738 sparse matrix of type '<class 'numpy.float64'>'
    with 3043842604 stored elements in Compressed Sparse Row format>
counts.data
array([4.        , 1.        , 9.99999954, ..., 1.        , 1.        ,
       1.        ])
counts.data
array([4.        , 1.        , 9.99999954, ..., 1.        , 1.        ,
       1.        ])
scaled_counts
<326700x32738 sparse matrix of type '<class 'numpy.float32'>'
    with 276712964 stored elements in Compressed Sparse Row format>
counts.data
array([3.59520866, 0.89880217, 8.98802125, ..., 0.34156169, 0.34156169,
       0.34156169])
smtx.data
array([1.6352079, 1.6352079, 2.2258174, ..., 1.7980369, 1.7980369,
       2.779648 ], dtype=float32)
x=smtx.getrow(0).data

x=pd.Series(x).value_counts().sort_index()

x=np.exp(np.array(x.index))-1

x=x/(x[0]+0.1)

np.abs(x-x.round()).mean()
0.24593542762363435
np.alltrue(x.index[:2]==x.sort_values(ascending=False).index[:2])
True
pd.Series(x).value_counts().sort_index().index==pd.Series(x).value_counts().sort_values(ascending=False).index
array([ True,  True, False, False, False, False, False, False, False,
       False, False,  True, False, False, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False, False, False, False,  True, False, False])

Testing get_scaling_factor

x=smtx.getrow(1).data
x
array([ 4.0791354,  2.0395677,  2.0395677, ...,  2.0395677, 16.316542 ,
        2.0395677], dtype=float32)
get_scaling_factor(x,method='Top2')
0.4904214559386973
get_scaling_factor(x,method='Reg')
0.49042145712545715
get_scaling_factor_by_top2(x)
0.4904214559386973
get_scaling_factor_by_reg(x)
0.49042145712545715

Testing check_unscale

x=ad1.X.getrow(0).data
check_unscale(x,method='Top2')
True
check_unscale(x,method='Reg')
True
check_unscale(x,base=2)
ERROR:root:The base 2 is not match with the data.
False

Testing scdenorm

mmwrite('data/scaled.mtx', smtx[1:10,])
ad1=sc.read('data/scaled.mtx')
ad1
AnnData object with n_obs × n_vars = 9 × 32738
mmwrite('scaled_tran.mtx', smtx.transpose())

1.Testing unscale_mat

  • input is a sparse matrix with cell by gene
  • output is a sparse matrix after denormalization
unscale_mat(ad1.X,method='Top2')
100%|██████████| 9/9 [00:00<00:00, 652.27it/s]
(<9x32738 sparse matrix of type '<class 'numpy.float64'>'
    with 7403 stored elements in Compressed Sparse Row format>,
 [0, 1, 2, 3, 4, 5, 6, 7, 8])
unscale_mat(ad1.X,method='Reg')
  0%|          | 0/9 [00:00<?, ?it/s]WARNING:root:Warning:The cell 3 fails to denormlize, and be deleted
100%|██████████| 9/9 [00:00<00:00, 201.71it/s]
(<8x32738 sparse matrix of type '<class 'numpy.float64'>'
    with 6881 stored elements in Compressed Sparse Row format>,
 [0, 1, 2, 4, 5, 6, 7, 8])

2.Testing scdenorm with right base

scdenorm('data/scaled.mtx',fout='data/scd_scaled.h5ad',base=np.e, verbose=1,cutoff=0.05,method='Top2')
INFO:my_logger:Reading input file: data/scaled.mtx
INFO:my_logger:The dimensions of this data are (9, 32738).
INFO:my_logger:Denormlizing ...the base is 2.718281828459045
100%|██████████| 9/9 [00:00<00:00, 899.38it/s]
INFO:my_logger:Writing output file: data/scd_scaled.h5ad
scdenorm('data/scaled.mtx',fout='data/scd_scaled.h5ad',base=np.e, verbose=0,cutoff=0.05,method='Reg')
  0%|          | 0/9 [00:00<?, ?it/s]WARNING:my_logger:Warning:The cell 3 fails to denormlize, and be deleted
100%|██████████| 9/9 [00:00<00:00, 219.13it/s]

3.Testing scdenorm with wrong base

scdenorm('data/scaled.mtx',fout='data/scd_scaled.h5ad',base=2, verbose=1)
INFO:my_logger:Reading input file: data/scaled.mtx
INFO:my_logger:The dimensions of this data are (9, 32738).
INFO:my_logger:Failed to obtain scale factor. Error is 0.23859739923659193. x is [ 1.          1.79478553  2.49621531  3.13708999  3.73423045  4.29941127
  4.83767872  5.35576114  5.85365854  6.33809924  6.80571909  7.25988225
  7.71068124  8.14129521  8.57190917  8.98906644  9.39949537  9.80319596
 10.20016821 10.97392767 11.35071489 11.7275021  12.09083263 12.81749369
 13.17409588 13.52396972 13.87384357 14.22371741 14.56013457 14.89655172
 15.23296888 15.56938604 15.89234651 16.21530698 16.53826745 17.18418839
 17.49369218 17.80319596 18.42220353 18.73170732 19.0412111  20.81749369
 21.11354079 21.40958789 22.25735913 23.38772077 23.93944491 27.41126997
 28.9587889  31.67703953 32.8881413  43.06139613]
INFO:my_logger:The base 2 is not match with the data.

ERROR:my_logger:Denormlization has failed. Output the orignal data
INFO:my_logger:Writing output file: data/scd_scaled.h5ad

4.Testing scdenorm without base

scdenorm('data/scaled.mtx',fout='data/scd_scaled.h5ad',verbose=1)
INFO:my_logger:Reading input file: data/scaled.mtx
INFO:my_logger:The dimensions of this data are (9, 32738).
INFO:my_logger:Selecting base
INFO:my_logger:Denormlizing ...the base is 2.718281828459045
b is 2.718281828459045
100%|██████████| 9/9 [00:00<00:00, 1280.36it/s]
INFO:my_logger:Writing output file: data/scd_scaled.h5ad
scdenorm('data/scaled.mtx', verbose=1)
INFO:root:Reading input file: data/scaled.mtx
INFO:root:The dimensions of this data are (9, 32738).
INFO:root:select base
INFO:root:denormlizing ...the base is 2.718281828459045
b is 2.718281828459045
100%|██████████| 9/9 [00:00<00:00, 990.42it/s]
AnnData object with n_obs × n_vars = 9 × 32738
scdenorm(ad1, verbose=1)
INFO:root:The dimensions of this data are (9, 32738).
INFO:root:select base
INFO:root:denormlizing ...the base is 2.718281828459045
b is 2.718281828459045
100%|██████████| 9/9 [00:00<00:00, 960.72it/s]
AnnData object with n_obs × n_vars = 9 × 32738

4.Testing scdenorm with auto detection

ad=sc.datasets.pbmc3k()
/home/huang_yin/anaconda3/envs/sc/lib/python3.9/site-packages/anndata/__init__.py:51: FutureWarning: `anndata.read` is deprecated, use `anndata.read_h5ad` instead. `ad.read` will be removed in mid 2024.
  warnings.warn(
ad=ad[:200].copy()
ad
AnnData object with n_obs × n_vars = 200 × 32738
    var: 'gene_ids'
sc.pp.normalize_total(ad, target_sum=1e4)
sc.pp.log1p(ad)
scdenorm(ad,verbose=1)
INFO:my_logger:The dimensions of this data are (200, 32738).
INFO:my_logger:Selecting base
INFO:my_logger:Denormlizing ...the base is 2.718281828459045
b is 2.718281828459045
100%|██████████| 200/200 [00:00<00:00, 1783.47it/s]
AnnData object with n_obs × n_vars = 200 × 32738
    var: 'gene_ids'
    uns: 'log1p'
scdenorm(ad,base='auto',verbose=1)
INFO:my_logger:The dimensions of this data are (200, 32738).
INFO:my_logger:Denormlizing ...the base is 2.718281957850356
      fun: 1.045837662921792e-10
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
      jac: array([5.11726853e-06, 1.49873895e-06])
  message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
     nfev: 99
      nit: 31
     njev: 33
   status: 0
  success: True
        x: array([2.71828196, 0.99999995])
100%|██████████| 200/200 [00:00<00:00, 1013.93it/s]
AnnData object with n_obs × n_vars = 200 × 32738
    var: 'gene_ids'
    uns: 'log1p'
base,cont=auto_detect(ad.X,1e-3)
      fun: 6.749703574420865e-05
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
      jac: array([-0.28310845, -0.04961504])
  message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
     nfev: 81
      nit: 26
     njev: 27
   status: 0
  success: True
        x: array([2.71767589, 1.00073907])
base
2.717675894469142